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Abstract 

Background: Protein domains represent the basic units in the evolution of proteins. Domain 
duplication and shuffling by recombination and fusion, followed by divergence are the most 
common mechanisms In this process. Such domain fusion and recombination events ire predicted 
to occur only once for a given multidomain architecture. However, other scenarios may be 
relevant In the evolution of specific proteins, such as convergent evolution of multidomain 
architectures. With this in mind, we study glutaredoxin (GRX) domains, because these domains of 
approximately one hundred amino acids are widespread in archaea, bacteria and ©ukaryotes and 
participate In fusion proteins. GRXs are responsible for the reduction of protein disulfides or 
gtutathlone-protein mixed disulfides and are involved In cellular redox regulation, although their 
specific roles and targets are often unclear. 

Results; In this work we analyze the distribution and evolution of GRX proteins in archaea. 
bacteria and eukaryotes. We study over one thousand GRX proteins, each containing at least one 
GRX domain, from hundreds of different organisms and trace the origin and evolution of the GRX 
domain within the tree of life. 

Conclusion: Our results suggest that single domain GRX proteins of the CGFS and CPYC classes 
have, each, evolved through duplication and divergence from one initial gene that was present in 
the last common ancestor of all organisms. Remarkably, we Identify a case of convergent evolution 
in domain architecture that involves the GRX domain. Two independent recombination events of 
a TRX domain to a GRX domain are likely to have occurred, which is an exception to the dominant 
mechanism of domain architecture evolution. 



Background 

Domain duplication and shuffling by recombination and 
fusion, followed by divergence are the more frequent 
mechanisms for the evolution of proteins | lj. It has been 
estimated that such recombination and fusion events are 
likely to occur only once for a given muludoniain archi- 
tecture and that, after such an event, the fusion protein is 



duplicated and/or diverges overtime j I J. In addition, sta- 
tistical analysis of known multidomain proteins has 
shown that a) there is a strong bias for individual domai ns 
involved in recombination and fusion events to be short 
|2|, and b) some specific sets of recombined domains 
(supra domains) participate in further recombination and 
fusion events (3|. This provides a model for the dominant 
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mode of domain architecture evolution in proteins that is 
very much consensual. Recent work has further estimated 
that between 88% mid » 3% of all muitidomaiu architec- 
tures have evolved through such mechanisms |4-GJ. The 
rcmahiin^archiw-cturt^iirc likely to have evolved through 
convergent evolution |6|. A recent theory proposes that, 
during major evolutionary munitions, evolution is bipha- 
se, ftmher complicating die model of protein evolution 
1 7). According to this view, in an initial post- transition 
phase, Urge scale horizontal gene transfer (HOT) would 
occur, this would be followed by a second phase where 
the more common mechanisms for protein evolution 
become dominant. Given this background, It. ix of interest 
to analy.ee the evolution of a , specific type of protein 
domain in order to assess the importance of the evolu- 
tionary mechanisms described above in the evolution of 
that domain. 

A protein domain of small sifce that Is known to partici- 
pate in the archiieauic of multidoraain proteins and Is 
widespread over the many branches of die evolutionary 
tree would hp an appropriate choice to study. The glutace- 
dnxin (GRX) domain meets all ihese conditions. U Iras 
approximately one hundred amino acids, it is a part of 
several iuultidoinain architectures and It is present in 
arcliaea, bacteria and enkaryotes, CRXs are thiol oxidore 
ductasfs responsible for the reduction of piottln 
disulfides or glutathione. protein mixed disulfides, which 
employ reduced glutathione (GSM) as hydrogen donor 
|8|, Toge-dier with thioTcdoxins (TRXs) and other pro- 
teins. CRXs are grouped In the tiiioredoxin fold super* 
family, because they shaie a common stmctural fold 
consisting of a four or five-stranded ^shee l Hanked by 
several ar-helices on cither side of the /Asheet \9}. Func- 
tionally, TRXs are also disulfide oxlreductasea. In contrast 
with CRXs, oxidised TRXs are reduced by thioredoxius 
reductases at the expense of NADPH [8|. Alternatively, in 
plain, drtoraplasis. TRXs are reduced directly by the ferre- 
dosin/lerredoxln red urease system that is coupled lo pho- 
tosynthesis |10|. Three CRX families have been defined 
based on the sequence of the putative active sites. Classi- 
cal dithiol CBXs (CPYC class) with a C |P/S| |V/I : |C active 
site sequence are widespread in archaea, bacteria and 
eiikaryate*. On the other hand, inutiothiol CRXa (GCFS 
class) contaiw a CC^S-ltke active site sequence, and they 
have currently been reported to exist In bacteria and 
eukuryotes |Hj. finally, land plants contain (in addition 
to CKXs species of the above classes) a third class of CRX* 
(CCMC doss), with the sequence GC |M/L| |C/S| in the 
putative active sites 1 1 2,1 3|.This division in three groups 
is further complicated by the recent eharaaeruation in 
Saicharomym arwisitut of three GRXs (Grx6, Crx7 and 
Crx8) widi CSV5, CPYS and CPDC active site motif* j I4r 
Ifij. 



MuUidoimin proteins that contain CRX domains have 
also been reported in a variety of different, organisms 
| 3,17-1 !)|. The most studied CRX fusion proteins in 
eukaryotes aic TRX-CRX fusions eontaininfc CCB class 
GRXs modules, In lite.se proteins, one to three CUX mod- 
ules are linked to au N-ierminal TKX-iike module which 
docs not conserve the \VC |G/P|PC active sile of func- 
tional TRXs. 5. ccrmslae CrxJ and Gr*4 and human 
GLRX3 (P1COT) |20 J are examples of these muitidomntn 
GRXs- 

Most GRXs are likely to participate in a diversity of proc- 
esses that reenme red ox- type regulation, although their 
specific roles and targets sit ihose processes are often 
unclear. Examples of processes in which CPYC class GRXs 
are involved in include activation of ribonucleotide 
reductase, or 3 '-phosphoadenylyl sulfate reductase, reduc- 
tion of ascorbate, regulation of the DNA binding activity 
of nuclear factors, or protection against heavy mewls 
{reviewed In \2 1 1). Examples of processes in which CCM 
class CKXs are involved iridude iron sulfur cluster blogcu- 
esis {Grx3). and regulation of cellular iron homeostasis 
(Grx3 smd Gn4) in S. cmvhiae 1 11,18,22]. cytochrome 
biogenesis hi bacteria |23,24|, PHOKAR-{ lipid mem- 
brane) proteins in archaea \25\< and regulation of signal 
transduction pathways in response to externa! signals (the 
human PICQT)|2ft|. 

In this work we analyze rhe. distribution and evolution ot 
GRX domains in archaea, bacteria and eukaryotes, We 
study over one thousand proteins, containing at least one 
CRX domain, front hundreds of different organisms and 
trace the origin and evolution of the GRX domain within 
the tree of life. Gut phylogenetic analysis suggest* that sin- 
gle domain GRX proteins of the CGFS das* have evolved 
through duplication and divergence from one initial gene 
that was present in the last common ancestor (LCA) of all 
organisms. The same appears to hold true for single 
domain CRX proteins of the CPYC class. We predict sets 
of residues that are likely to he important for protein func- 
tion in the CGI** and CPYC cl asses of GRXs, Remarkably, 
we. identify a case of convergent evolution in domain 
architecture, where two independent recombination 
events ofaTRX domain to a CRX domain are likely to 
have occurred. We also identify domain combinations 
that in the context of GRX domain evolution, appear to 
function as the supra-do mains proposed by Vogel el at. 
IH 

Results 

W'e have made a systematic identification of GRX. 
domains from the UNIPROT database (Information sum* 
marized In Tahiti I). Over 75^ of all GRX domains are 
found-In single domain proteins (Table \ )< The other 25% 
are found as a part of a variety of muhidomain protei ns. 
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Table I: Number of proteins used In thr* study*- 









GRX Chstes 










Archies 




Bacteria 


Eukaryotes 




CGFS 


CPYC 


CGFS 


CPYC 


CGFS 


CPYC 


Single domain proteins 


5 


n 


452 


543 


31 


186** 


Mukldomain proteins 


0 


28 


I* 


223 


94 


47 


Total number of domains*** 


5 


1 20 


466 


766 


125 


233 



* Alignments are provided as supplementary files I to 6, where UNIPROT IDs and web links are also Included (or eich protein. Many of *e 
proteins that were found are predicted by homology annotation of fully sequenced genomes. **Out of these, 34 sequences belong to the CCMC 
cl«S. ***Gran<j total of 1717 GRX domains. 



For example, GRXs have combined with pyridine nucle- 
otide-disulphide oxidoreductases dass-II domains in 
thioredoxin-glutathione-reductase proteins, or with pep- 
tide methionine sulfoxide reductase domains, among oth- 
ers. An interesting case is that of the triple fusion between 
the GRX domain and frataxin and rhodanese domains. In 
this case, the frataxin-rhodanese cassette appears to act as 
a supra domain of recombination [3| in the context of 
GRX domain evolution, because proteins where the GRX 
domain recombined exclusively either with the frataxin 
domain or with the rhodanese domain are not found. In 
addition to these and other well characterized protein 
domains, GRXs are also associated with other types of less 
well characterized protein domains (for example DUF296 
domains). Many of these domains are recurrently found 
In different proteins but have yet to be assigned a specific 
function. Table 2 details the major domain types found to 
combine with the different GRX classes. {Additional Files 
I, 2 and 3 contain the sequences and alignments for the 
GRX domains that have been identified in multidomain 
proteins from the UNJPROT database.) Because of the 
large number of sequences being analyzed we divided the 
sequences into smaller sets for a more detailed and accu- 
rate analysis. We also analyzed the full set, with results 

Table 2: Type of major domains associated with GRXs in muhldomaln proteins. 



that are similar to tiie ones described below (data not 
shown). 

GRXs from single domain proteins 

Figure I A shows a condensed phylogenetic tree of single 
domain GRX proteins. Sequences of TRX domains were 
used as outgroup for the tree for control purposes. CPYC 
class and CGFS class GRXs segregate well in the phyloge- 
netic tree. GRXs of CCMC class cluster together, but also 
within the CPYC class (cluster 6.1 within cluster 6). Hie 
CCMC GRXs have only been identified in higher plants 
{Embriophita) and in single domain proteins. This obser- 
vation suggests that they may be an offspring or a subclass 
of the CPYC class of GRXs. We also find that the third 
position in the putative active centre of CCMC GRXs is 
occupied by an uncharged amino acid that is not necessar- 
ily a methionine residue, but can also be a branched 
amino acid. It is noteworthy that some GRXs that have, 
overall, higher sequence similarity to GRXs of die CPYC 
class contain active sites where the final cysteine residue 
has been replaced by other residues. This is summarized 
in Figure 1 A, where for example in cluster 6, GRXs with 
CGFS active sites are present. In addition, there are a few 
proteins with high similarity to GRXs that have com- 



GRX Classes 



Domain type 



Eukaryotes 



CGFS 



CPYC 



CPYC 



TRX 

Frataxin- Rhodanese 
DUF296 
Methionine sulfoxide reductase 
Pyridine nucleotide disulfide oxirforeduccase 
PeroxJredoxIn 
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Condensed oncogenetic tree, for single domain CRXs from UNIPROT database Panel A: Global tree. The out- 
group on the lower left of the tree is composed of TRX single domain proteins (In yellow). All major divisions between clus- 
ters have bootstrap values of 100%. indicated by the n 100" label. The CGFS class of GRXs Is depicted on the upper branchs [in 
green]. Sequences are more homogeneous on this class. The mid-lower branches depict CPTC class GRXs [in mauve]. There 
Is a wider variability In the sequences of this class. The clusters Identified with the tag "Bacteria*" Include almost all non-proteo- 
bacteria GRX domains. These Include domains from Act/nobocterfa, Deirmococcus, Pfanktcmycetes, Green sulfur bacteria. Green 
non sulfur bacteria, Therrootofo, Aqu/foceae, and Ftovoboctena. The cluster Identified with the tag "Bacteria**" include mostly 
proteobacteria GRX domains, GRX domains from Cyanobacteria and Spirochetes are also present in this cluster. Panel B 
Condensed phylogenetk tree of single domain GRXs In archaea. Panel C: Phyiogenetk tree of GRX proteins for bacteria. 
CGFS class GRXs [Cluster I] have less variability In their sequence than CPYC class GRXs [other clusters]. Some GRX- like 
proteins have lost their active site fXXXX in Cluster 2]. Panel D: Phyiogenetk tree of GRX proteins for eukaryotes. CGFS 
GRXs [in green] have less variability in their sequence than CPYC GRXS [In mauve]. Nevertheless, the variability In the 
sequence of the active site for CGFS GRXs is far greater than that found In bacteria [compare to panel q. 



pletcly lost their active site [cluster 3 in Figure 1A, e-val- 
ues<l 0' 5 ).The function of these proteins is unknown, and 
this may represent a situation where the GRX domain is 
being co-opted for a new function that has yet to be char- 
acterized. 

Within each class of GRXs, different taxa cluster roughly as 
previously reported (see for example 126-28J). Fox exam- 
ple, in cluster 1 of Figure 1A bacterial GRXs of the CGFS 
class group together, and apart from eukaryotic GRXs of 
the same class. This is consistent with GRXs being present 
at the LCA of the three ki ngdoms and is inconsistent with 
massive HCT of GRX genes between different kingdoms. 
In fact. CCFS class and CPYC dass GRXs appear to have 
been both present in the LCA of all branches in the tree, 
because both classes of GRXs cluster apart and differenti- 
ation between archaea, bacteria, and eukaryotes is only 
observed within the clusters shown in Figure 1A. 

A group of proteins that have been annotated as GRX-like 
proteins in fully sequenced genomes, mostly in archaea 
and eukaryotes but also in bacteria, appear to be some- 
where in between TRXs and CPYC class GRXs in terms of 
sequence similarity (clusters 8 and 9 in Figure 1A), They 
have more variability in their sequence than the other 
clusters from Figure 1A, but they are nevertheless similar 
to other well characterized GRXs, with e-value £ 10'°. 
They all contain a GRX-like putative active site sequence. 
The bacterial sequences in this cluster come from groups 
that are not typically considered as having GRXs (see cap- 
tion for Figure 1A). 

A detailed analysis of the data also reveals that CPYC class 
GRXs have been found in all sequenced archaea genomes. 
However, within Archaea, we were only able to Find CGFS 
class GRXs in Habbacteriales. Five sequences of die CGFS 
class have been found in mis group (see see Supplemen- 
tary Table 1 in Additional File 4, Cluster 2 in Figure 1A 
and Cluster 4 in Figure IB). A more detailed analysis of 



the DNA sequence for the genes that code for these GRXs 
reveals the following. 

a) The highest homology between the H. salmanum 
CGFS class GRX and other non-archaea GRX is to a 
Myxococcus xanthus GRX [see Supplementary Table 1 in 
Additional File 41, 

b) The codon usage of genes is a characteristic that can 
often be used to identify HGT. This is so because the 
evolution of an organism leads to an optimization of 
the codon usage in genes foi the specific physiology of 
that organism [29]. We calculated the average codon 
usage in the Myxococcus xanthus and in the H. sali- 
narium genomes [29}. We also calculated the codon 
usage for the CGFS class grx genes from H. lalinarium 
and M. xanthus. We found that the codon usage in the 
CCFS class GRXs of Halobaterium is similar to the aver- 
age codon usage in the Myxococcus genome. It is also 
similar to the codon usage in the Myxococcus CCFS 
class grx genes. 

c) The borders of the CGFS class grx gene in H. sail- 
narium are homologous to the flanking regions of the 
transposon gene XAC3504 from the gamma proteo- 
bacteria Xanthomonas axonopodis. 

d) The flanking regions of the CGFS class grx genes in 
archaea could be degenerated palindromes. If this is 
so, it may indicate the remains of a degenerated trans- 
poson. Transposons are responsible for gene mobility 
within and between genes through HGT. 

Taken together, these observations suggest that CGFS class 
GRXs in archaea may have been the result of one HGT 
from some proteobacteria ancestor to the halobacxe nates. 

Unlike in archaea, single domain CGFS class and CPYC 
class CRXs are both widespread in bacteria, as shown in 
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Figure IC. Both classes of GRXs appear to have existed as 
such in the LCA of bacteria, because in general CGFS class 
GRXs cluster together, as do CPYC class GRXs. Interest- 
ingly, some GRXs that are, sequence-wise, clearly included 
in the CPYC class have evolved CGFC putative active cent- 
ers (duster 6 in Figure 1 C; also sec Additional Files 1, 2, 3, 
5, 6, and 7), although no CGFS class GRXs with a second 
cysteine in the active site where found. 

Figure ID shows a phylogenetic tree based on the 
sequence alignment of eukaryotic single domain GRXs. 
CGFS class and CPYC class GRXs are also commonly 
found in eukaryotes. A fraction of CPYC class CRXs have 
lost the second cysteine residue of their putative active site 
(GRXs in clusters 3, 5, 6 and 8). 

Sequence anctytis of Individual GRX domains from an 
evolutionary and functional perspective 

We compared the variability of the CPYC class alignments 
to that of the CGFS alignments. 'This variability can be 
quantified by calculating both, the average positional 
entropy of the alignment and the normalized mutual 
information (NMI) between each pair of positions in the 
alignment (see Methods for details). 

The higher the average positional entropy is, the higher 
the variability per position in the alignment is (see e g: 
|30]). Based on this we find that the CPYC class has higher 
sequence variability than the CGFS class. Although the 
difference in variability between the two classes may be 
due to the larger number of sequences that have been 
identified for the CPYC class, we believe that this is not 
the case, because the standard deviation of the positional 
entropy in the two classes is almost the same (approxi- 
mately 0.8 for the CPYC class and approximately 0.75 for 
the CGFS class. Figure 2). 

The results from NMI profiles support those from the 
averaged position entropy, although the interpretation of 
these profiles is more nuanced. On one hand, if two posi- 
tions in the alignment have a NMI that is very high, one 
can interpret this result as indicating that any changes in 
the residue at oneof the positions needs to be counterbal- 
anced by a compensatory change in the residue of the 
other position [ 30-35 J. TTius, the residues in those posi- 
tions are functionally constrained and functionally 
important for the protein. On the other hand, positions 
with residues that are highly conserved will have a NMI 
with any other position in the alignment that is not signif- 
icantly different from zero. Therefore, highly conserved 
positions are also likely to be functionally constrained. All 
the information regarding conservation and co-variation 
of residues is summarized in Figure 2. Many of the resi- 
dues predicted in Figure 2 as being functionally important 
are known to be involved in the overall function of the 



GRX domains. For example, residues in the putative active 
site of both classes of CRXs are marked with a blue trans- 
parent rectangle. ]n the same figure, one can sec that CGFS 
class GRXs have a larger number of positions in the align- 
ment that either have high NMI interaction with other 
position (black lines and black residues) or are highly 
conserved (red lines and red residues). If our results are 
general and our interpretation is correct GRXs of the 
CGFS class may require more positions to be constrained, 
if they are to remain functional. 

An alternative explanation for the differences in the varia- 
bility between CPYC and CGFS GRXs is the following. The 
larger variability of CPYC GRXs could also be observed if 
CPYC class GRXs had an earlier origin than CGFS GRXs. 
because this could have allowed for CPYC GRXs to have 
evolved for a longer time. Our results suggest that this 
explanation is unlikely. In Figure 1, the CPYC and CGFS 
classes of GRXs duster perfectly apart* and the branching 
structure of the tree indicates that both classes were 
already present in the LCA of archaea. bacteria and 
eukaryotes. This suggests that both proteins may have 
had, roughly, a similar amount of time to evolve. The 
DNA sequences of the different domains can be further 
analyzed in order to assess if the two classes of GRXs have 
been evolving for significantly different times. It is known 
that the rate of synonymous mutations in different genes 
is similar, even when the rate of non-synonymous muta- 
tions is quite different [33 J. Therefore, by comparing the 
average percentage of synonymous substitutions per 
codon between the CGFS class and the CPYC class GRXs 
in the conserved positions of each multiple alignment 
one can obtain additional information regarding whether 
these proteins have been evolving for approximately the 
same time. This percentage is similar in both classes of 
proteins. In addition, we calculated the average ratio of 
non.synonymous (Ka) over synonymous (Ks) mutations 
per codon to be <Ka/Ks> » 2.9 for the GRX domains of 
each of the two classes. This rario can also be used to esti- 
mate the rate of evolution of proteins [30j. Taken 
together, all these data are consistent with the notion that 
the difference in the number of highly conserved posi- 
tions in both classes of GRXs is not due to a difference in 
the time they had to evolve. The protein and DNA 
sequence alignments are provided {see Additional Files 1, 
2, 3, 5, 6, and 7j. 

GRXs domains In muftldomoln proteins 

As stated above, and summarized in Table 2 and Figure 3, 
rnultidomain proteins that contain GRX domains are 
widespread. To more accurately analyze these GRX 
domains we isolated their sequence from the rnultido- 
main proteins, as described in the Methods section. The 
GRX domains were then aligned using MEGA4. The mul- 
tiple alignment was used to build a phylogenetic tree. A 
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Figure 2 

Functional Importance of residues in the different position of Che alignment Sequences were divided into two 
classes (CGFS and CPYC). Then, they were further divided into Archaeal. Bacterial and Eukaryotk sequences. Because archaea 
only have a small number of GRX proteins of the CGFS class, we show the results for archaea GRXs of the CPYC class only. 
Subsequently, Normalized Mutual Information (IMMI) and Position Entropy were calculated for the different alignments. Verti- 
cal red lines In the plots of the first and second column indicate positions in the alignment that are highly conserved. Horizontal 
black lines in the plots of the first and second columns Indicate positions in the alignment that have high NMI with at least one 
other position. Numbers in columns one and two indicate the position in the alignment. First column - CPYC class GRXs. Sec- 
ond column - CGFS class GRXs. Third and fourth columns - Detailed positions In the alignments that are colored, with the 
consensus residues found in those alignment positions. The putadve active centers are shaded In blue. Average entropy per 
residue is also shown in columns three and four. 



condensed version of this tree is shown in Figure 4 , It is 
striking that, with few exceptions, all GRX domains from 
proteins containing a specific type of domain combina- 
tion are, sequence- wise, closely related amongst them- 
selves. They cluster together in the phylogenetic tree and 
apart from GRX domains extracted from other types of 
proteins. The simplest interpretation of the data presented 
in Figure 4 U that a specific type of CRX-containing multi- 
domain proteins is descendent from an original gene 



fusion event. This event is likely to have occurred before 
the LCA of all organisms containing the relevant type of 
multidomain protein (Figure 3). If independent gene 
fiision events between different GRX domains and a given 
protein domain occurred at different stages of the evolu- 
tionary process, then one would not expect the isolated 
GRX domains from these fusion proteins to duster 
together and apart from all other GRX domains. 
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Figure 3 

Schematic representation for the distribution of major groups of multtdomain proteins containing GRX 
domains In the tree of life. The most ancient fusion appears to be that of GRX domains to pyridine nucleotide disulphlde 
oxireductase domains because it is present in all kingdoms. TRX-GRX fusions may also hare been present in the LCA (Latest 
Common Ancestor) of all three branches. Proteins standing at the end of a node are the ones identified in current organisms. 
Proteins standing on top of the branches of the tree and surrounded by upper and under lines that are dashed suggest when 
that type of protein may have first originated. All domains identified as unknown in the figure have not been identified in any of 
the available domain databases. All domains identified as PFAM in the figure have been identified in PFAM as associated to pro- 
teins of unknown function. Panel A - Non- TRX-GRX fusion proteins. Panel B - TRX-GRX fusion proteins. TRX-GRX fusion 
proteins exist in all three major branches of the tree of life. Duplication of the GRX domain appears to have occurred early In 
the eukaryotic life history. Some TRX-GRX proteins appear to have undergone two consecutive partial dupllcation/recombina- 
tion events of the GRX domain after the initial TRX-GRX fusion. Distribution of TRX-GRX fusion proteins may result from a) 
multiple independent deletions of the TRX-GRX fusion In different bacterial branches and in the eukaryotic branch, followed 
by a new TRX-GRX fusion event in the eukaryotes (Scenario A. red lines), b) Two independent TRX-GRX fusion events, one 
in archaea and one In eukaryotes followed by horizontal gene transfer From archaea to bacteria (Scenario B. dashed mauve 
line), or c) TRX-GRX fusion originating in the ancestor of some bacterial lineages, followed by horizontal gene transfer to the 
ancestor of Archaea and an independent TRX-GRX fusion event in eukaryotes (Scenario C dashed black line). See Discussion 
for further details. 



Only the GRX-pyridine nucleotide disulphide oxireduct- 
ase fusion and the TRX-GRX fusion are found in all three 
branches of the tree of life. All other multidomain pro- 
teins that contain a GRX domain are found only in spe- 
cific branches of that tree (Figure 3A). For exampte, 
peroxiredojtin-CRX fusions are found so far only in pro- 
teobacteria. TRX-GRX multidomain proteins appear to 
have a more complex history. In eukaryotes, all GRX 
domains in TRX-GRX fusion proteins are of the CGFS 
class, in contrast, they are of the CPYC class in bacteria 
and archaea. This suggests that two independent recombi- 
nation events took place to form the original TRX-GRX 
fusion proteins. In addition. TRX-GRX fusion proteins 
have undergone further domain shuffling in eukaryotes. 
An analysis of the phylogenetictree in Figure 4 suggests a 
duplication of the GRX domain to have occurred in TRX- 



GRX domains, forming TRX-GRX-GRX proteins before 
fungi, plants and animals have separated. A more recent 
duplication-recombination event of the GRX domain is 
present in TRX-GRX- GRX-GRX proteins found in plants 
and in some protists. Based on how the different GRX 
domains cluster in the phylogenetic tree, we cannot 
exclude that the duplication events in plants and protists 
are independent. However, the bootstrap of the tree for 
the separate clustering of GRX domains from protists and 
plants is lower than 50% (Figure 4, clusters 1 and 6). This 
means that only in less than 50% of the uees built using 
bootstrap do we find separation of the protists cluster and 
the plant cluster. Such a fact suggests that this final recom- 
bination event may have taken place either a) in a com- 
mon ancestor of plants and protists or b) in one of the 
branches, followed by H GT to the other branch. Nevenhe- 
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Afchaea/Bacttria/Cukaryotei 

Figure 4 

Phylogenetic tree of GRX domains from multi domain proteins. CGFS class GRX domains pn green] have less varia- 
bility in their sequence than CPYC class GRX domains [In mauve]. Note that GRX domains from NrdH-like proteins cluster 
closer to GRX domains from TRX-GRX fusion proteins, and within the CPYC class. This is consistent with previous results 
from biochemical asiays and sequence analysis that suggest NrdH-like proteins to cluster between TRXs and GRXs [SI J. 



less, it is of interest that the TRX-GRX multidomain pro- 
teins may have gone through several cases of convergent 
evolution of domain architecture. 

In eukaryotes, most GRX domains in multidomain pro- 
tei ns are of the CGFS class (Table 2 and Figure 4). The 
exceptions are the CPYC class GRX domains recombined 
with pyrimidine disulphide oxireductase domains. In 



contrast, in prokaryotes, most GRX domains in multido- 
main proteins are of the CPYC class (Table 2 and Figure 
4J, GRX- containing protein architectures that are specific 
to bacteria are: 

a) Proteins with a domain structure of the type perox- 
iredoxin-CJOC, with the GRX domain being of the 
CPYC class. Peroxiredoxins are thiol dependent pcrox- 
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idases Involved in cell protection against oxidative 
stress. 

b) Proteins with a domain structure DUF296-GRX. 
DUF296 is a domain of unknown function that con- 
tains what appears to be a zinc finger like motif (37 1. 
This suggests that these proteins may be involved in 
DNA binding, probably acting in regulation of gene 
expression. The GRX domain is of the CPYC class. 

c) Proteins with a domain structure frataxin -rhoda- 
nesc-GRX. Single domain frataxin proteins are 
involved in iron storage and metabolism and in iron- 
sulfur duster biogenesis |22,38|. The acidic aspartate 
and glutamate residues that arc responsible for iron 
binding in the frataxin protein are, for the most part, 
conserved in the fusion proteins, suggesting that these 
domains may remain at least partially functional. On 
the other hand, proteins containing rhodanese 
domains participate in sulfur metabolism [39 J. Again, 
the cysteine and glycine residues of the rhodanese 
active center are for the most part conserved in the 
fusion proteins, suggesting that the rhodanese 
domains may remain at least partially functional. 
Considering mat S. cerevisiae Grx5 is implicated in 
mitochondrial maturation of iron-sulfur clusters [ 18|, 
this suggests a connection between the complex 
frataxin-rhodane&e-GRX proteins and iron-sulfur 
(cluster) metabolism, 

Discussion 

GRX domains are a part of the thioredoxin-fold super- 
family. This superfamily includes domains of several 
classes, such as TRXs, PRXs (Peroxiredoxins), GSTs (glu- 
tathione S-transferase), GRXs, among others. Each of 
these domain classes is involved in different types of 
redox reactions that follow dissimilar mechanisms. In this 
work we analyzed GRX protein domains from a wide 
number of organisms in the tree of life. We confirm that 
GRXs are widespread over the three major kingdoms of 
life, both as individual proteins and as domains of larger 
proteins. We detect no signal of extensive HGT of GRXs 
among the different taxa, except for GRXs of the CGFS 
class between bacteria and Halobacteriates (archaea). 
Because GRXs are present in archaea, bacteria and eukary- 
otes, our results suggest that both the CPYC class and the 
CGFS class of GRXs were already present in the LCA. This 
inference follows from accepting the proposal that bacte- 
ria diverged first, followed by the divergence between 
archaea and eukaryotes |40|. If one accepts this view, it 
follows that archaea and many bacterial branches have 
lost GRXs of the CGFS class somewhere in the early stages 
of their evolution. 



Our analysis of the available data leads us to speculate 
about the origin of GRXs. One should recall that GRXs 
and TRXs belong to the same protein super family because 
they have a common structural fold. In addition, when 
one uses BLAST to search for GRX sequence homologues 
in the UNIPROT database, we find TRX domains as the 
closest relatives of CRX domains (data not shown). Tak- 
ing these two facts together, one could make the case that 
both types of domains may have originated from the same 
common ancestor gene. If this is so, then the clusters in 
Figure 1 A may be indicative of how TRXs and GRXs have 
diverged. Initially, a TRX ancestor may have been dupli- 
cated well before the LCA. After divergence, this dupli- 
cated TRX ended up becoming the ancestor of the GRX 
domain proteins. At this stage, further duplication events, 
either of the GRX domain or of the TRX domain may have 
led to the formation of the different GRX classes. The 
branching structure of the uee shown in Figure 1A, 
together with the active site information that is also 
shown in that figure, leads us to further speculate that 
sequences from clusters 8 and 9 may be fossil sequences 
that are remnants of intermediate steps in die divergence 
of the original GRX domain towards current day GRXs of 
the CGFS and CPYC classes. If one accepts this picture, 
then it follows that die original ancestral of all GRXs is of 
the dithiolic type (that is with a CXXC active center) and 
that CGFS class GRXs would be more recent than CPYC 
class GRXs. even though both may have been present at 
the LCA of archaea, bacteria and eukaryotes. Consistently 
with this view, CPYC class GRXs sometimes lose the sec- 
ond cysteine in their active site, while no GRX with two 
cysteines in its active site was found in clusters with CGFS 
class GRXs. 

If the picture described in the previous paragraph is cor- 
rect, CGFS- class GRXs would be the more specialized 
GRXs, which could entail having stronger functional con- 
straints against mutations in their sequence. Our analysis 
finds that GRXs of the CGFS class indeed have a higher 
percentage of conserved residues than those of the CPYC 
class and that the average residue variability, over the 
stretches of the proteins where residue conservation is 
low, is similar in the CGFS and CPYC classes. This further 
supports the notion that the lower sequence diversity in 
the CGFS class may be due to this class having functional 
constraints on a larger set of residues than the CPYC class. 
A deeper understanding of the catalytic mechanisms of 
the two GRX classes and of the protein dynamics during 
catalysis would be necessary if one is to confirm this spec- 
ulation and explain it in a mechanistically rational way. 
Nevertheless, enzymatic studies already indicate that the 
mechanism of action of bom types of GRXs is different 
|U,17|. 
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GRX modules as parts of multidomain proteins are wide- 
spread. Figure 3 summarizes the broad phylogenetic dis- 
tribution of these proteins. The only GRX- containing 
multidomain architecture that is common to the three 
kingdoms [with the same class of GRX module) is that 
found in CRX-pyridine nucleotide disulfide oxi reductase 
proteins. This suggests thai this type of fusion protein may 
be ancestral to the divergence between the three kingdoms 
oflifc. 

The question arises on the biological advantages of such 
domain fusions. In the case of S. cerevhiu Gdc3, the TRX 
domain seems to be required for the nuclear targeting of 
the molecule |41), without having enzyme activity as 
redoxin. However, in other cases both domains might 
retain the original enzyme activities and be involved in 
different stages of the same biological process. Domain 
combinations involving redoxin modules could be an 
evolutionary strategy to incorporate into a single peptide 
the enzymatic redox activity and the target of such redox 
control. The case of the fusions between GRX domains of 
the CGFS class and frataxin-rhodanese domains could be 
an example of this situation. The participation of CGFS 
class GRXs in the synthesis of iron-sulfur clusters is well 
known to occur in eukaryotes |18], although such partici- 
pation has not been demonstrated in bacteria up to date. 

An interesting observation is that while TRX -GRX proteins 
in eukaryotes have GRX modules of the CGFS class, TRX- 
GRX proteins in bacteria and archaea have GRX modules 
of the CPYC class. This is consistent with the fusion 
between TRX and GRX modules having occurred inde- 
pendently in evolution at least twice. Furthermore, the 
TRX-GRX fusion proteins in eukaryotes have undergone 
further evolution, giving rise to proteins where the CRX 
module has been duplicated once or twice. The observa- 
tions are consistent with the following scenarios (Figure 
3B): 

a) Scenario A: TRX-GRX fusion, with GRX belonging 
to the CPYC class, present at the LCA of the three king- 
doms. TRX-GRX fusion lost in the eukaryotic branch, 
with a subsequent, independent, new TRX-GRX fusion 
event occurring in eukaryotes, this time with a GRX of 
the CGFS dass. 

b) Scenario B: TRX-GRX fusion, with GRX belonging 
to the CPYC class, occurring in ancestral archaea, fol- 
lowed by HCTof this protein to the common ancestral 
of several bacterial lineages. Independent TRX-GRX 
fusion event occurring in eukaryotes, this time with a 
GRX of the CGFS class. 

c) Scenario C: TRX-GRX fusion, with GRX belonging 
to the CPYC class, occurring in ancestral bacteria, fol- 



lowed by horizontal gene transfer of this protein to the 
common ancestral of archaea. Independent TRX-GRX 
fusion event occurring in eukaryotes, this time with a 
GRX of the CGFS class. 

We do not have enough data to distinguish between the 
three hypotheses. However, Occam's razor suggests that 
scenario b) may be the most likley, because that is the sce- 
nario where a smaller number of events would have had 
to occur. This would point to the TRX-CRX hybrid pro- 
teins as an interesting example of convergent evolution in 
domain architecture. 

Conclusion 

In this study we trace the origin of glutaredoxins to the 
LCA of archaea, bacteria and eukaryotes. We propose 
probable patterns of evolution for the different GRX 
classes and trace the origin and evolution of recombina- 
tion events between the GRX domain and other protein 
domains, We find at an interesting case of convergent evo- 
lution in the domain architecture of TRX-GRX proteins. 

Methods 

Retrfevof and curaxlon of GRX sequences 

We used the UNIPROT database |42] to retrieve all 
sequences analyzed in this study. We downloaded the 
sequences of all proteins containing a CRX domain in 
FASTA foTmat |43|. These proteins have been identified 
using PS1 -BLAST |44| . As query sequences for the BLAST 
search we have used all GRX domain sequences from S. 
cereuisiae, Escherichia coli and Halcbacterium salinarium. All 
cDNA sequences corresponding to the UNIPROT entries 
were retrieved from Genebacik |45|. Once all GRX 
domains were identified, we analyzed them in isolation. 

Because a large number of very similar sequences were 
identified, we used the "Decrease Redundancy" program 
at SWISSPROT [46] to reduce the number of sequences to 
analyze. We set the algorithm to eliminate all sequences 
that had more than 90% identity to the sequence that was 
retained in the set. We then cross-checked the eliminated 
sequences and re-introduced any sequence from an organ- 
ism for which no close relative with a similar sequence 
had been retained. This was done using a local PERL 
script. A summary statistics of the analyzed sequences is 
shown in Table 1 . 

Sequence alignments and building of phyJogenetfc trees 

MEGA4 [47] was used to build sequence alignments and 
the phylogenetic trees shown in Figure I. All trees pre- 
sented here were built using a minimum evolution model 
and bootstrapped one thousand times. Dendroscope |4B) 
was used for tree a nalysis and represeniati on. TRX domain 
sequences were used for control purposes as an outgroup 
for the nee building alignments. As a control we have also 
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build trees from the same set of sequence using neighbor 
ioining methods, and maximum likelihood methods. The 
resulting trees were similar to those shown in the Figures 
I and 4. 

Domain identification fn muttldomaln prote/ni 

Multidomain proteins were identified using the Domain 
Fishing server |49| and PROS1TE |50|. The GRX domains 
as identified by PROSITE were then manually excised 
from the longer proteins. Whenever a domain was exclu- 
sively identified hy the Domain Fishing Server, that 
domain was then manually excised from the longer pro- 
teins. 

Analysis of residue conservation In alignments 

Mutual information between different position pairs in an 
alignment is indicative of how much the residue variation 
in one position constrains the residue variation in the 
other position. It was calculated using the formula 

f(AA[in|.M[/.kl») 



where Max[MI(allgn- 



, where f(AA [\, mj) represents the relative frequency of 
amino acid of type m in position i of the alignment and 
f(AA (i, mj. AA \), kj) represents the join relative fre- 
quency of that amino acid in position m and of amino 
acid of type k in position; of the alignment. The higher the 
covariance between any two positions, the higher the 
Mutual Information beiween those two positions will be. 
For representation purposes in Figure 2, we define nor- 
malized mutual information as 
m/( m) 

MarfMl'l alignment )) ' 
ment)} is the maximum MI between any two positions in 
the alignment. The higher NMI (i, j) is, the higher the cov- 
ariance between positions i and j will be. 

Positional entropy measures how high or low is the varia- 
bility at a given alignment position. Positional entropy for 
position k in an alignment was calculated using the for- 
mula S k - -j? Pj V ln( P( k ) , where P }k is the frequency of 
amino add / in alignment position k. 

The ratio Ka/Ks, where Ka stands for non-synonymous 
substitutions and Ks stands for synonymous in the DNA 
can be used to compare how fast different proteins are 
evolving 1 36 1. We use this ratio to compare the rate of evo- 
lution between CPYC class GRXs and CGFS class GRXs. All 
calculations were done using Mathematica. 
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